Crate polars[−][src]
Expand description
Polars: DataFrames in Rust
Polars is a DataFrame library for Rust. It is based on Apache Arrow’s memory model. Apache arrow provides very cache efficient columnar data structures and is becoming the defacto standard for columnar data.
This means that Polars data structures can be shared zero copy with processes in many different languages.
Tree Of Contents
Cookbooks
See examples in the cookbooks:
Data Structures
The base data structures provided by polars are DataFrame
, Series
, and ChunkedArray<T>
.
We will provide a short, top-down view of these data structures.
DataFrame
A DataFrame
is a 2 dimensional data structure that is backed by a Series
, and it could be
seen as an abstraction on Vec<Series>
. Operations that can be executed on DataFrame
s are very
similar to what is done in a SQL
like query. You can GROUP
, JOIN
, PIVOT
etc. The
closest arrow equivalent to a DataFrame
is a RecordBatch,
and Polars provides zero copy coercion.
Series
Series
are the type agnostic columnar data representation of Polars. They provide many
operations out of the box, many via the Series struct and
SeriesTrait trait. Whether or not an operation is provided
by a Series
is determined by the operation. If the operation can be done without knowing the
underlying columnar type, this operation probably is provided by the Series
. If not, you must
downcast to the typed data structure that is wrapped by the Series
. That is the ChunkedArray<T>
.
ChunkedArray
ChunkedArray<T>
are wrappers around an arrow array, that can contain multiples chunks, e.g.
Vec<dyn ArrowArray>
. These are the root data structures of Polars, and implement many operations.
Most operations are implemented by traits defined in chunked_array::ops,
or on the ChunkedArray struct.
SIMD
Polars / Arrow uses packed_simd to speed up kernels with SIMD operations. SIMD is an optional
feature = "simd"
, and requires a nightly compiler. If you don’t need SIMD, Polars runs on stable!
API
Polars supports an eager and a lazy API, and strives to make them both equally capable. The eager API is similar to pandas and is easy to get started. The lazy API is similar to Spark and builds a query plan that will be optimized. This may be less intuitive but could improve performance.
Eager
Read more in the pages of the following data structures /traits.
Lazy
Unlock full potential with lazy computation. This allows query optimizations and provides Polars the full query context so that the fastest algorithm can be chosen.
Compile times
A DataFrame library typically consists of
- Tons of features
- A lot of datatypes
Both of these really put strain on compile times. To keep Polars lean, we make both opt-in, meaning that you only pay the compilation cost, if you need it.
Compile times and opt-in features
The opt-in features are (not including dtype features):
-
lazy
- Lazy APIlazy_regex
- Use regexes in column selection
-
random
- Generate arrays with randomly sampled values -
ndarray
- Convert fromDataFrame
tondarray
-
temporal
- Conversions between Chrono and Polars for temporal data types -
strings
- Extra string utilities forUtf8Chunked
-
object
- Support for generic ChunkedArrays calledObjectChunked<T>
(generic overT
). These are downcastable from Series through the Any trait. -
Performance related:
simd
- SIMD operations (nightly only)performant
- ~40% faster chunkedarray creation but may lead to unexpected panic if iterator incorrectly sets a size_hint
-
IO related:
serde
- Support for serde serialization and deserialization. Can be used for JSON and more serde supported serialization formats.parquet
- Read Apache Parquet formatjson
- JSON serializationipc
- Arrow’s IPC format serializationdecompress
- Automatically infer compression of csv-files and decompress them. Supported compressions: * zip * gzip
-
DataFrame
operations:pivot
- pivot operation onDataFrame
ssort_multiple
- Allow sorting aDataFrame
on multiple columnsrows
- CreateDataFrame
from rows and extract rows fromDataFrames
.downsample
- downsample operation onDataFrame
sasof_join
- Join as of, to join on nearest keys instead of exact equality match.cross_join
- Create the cartesian product of two DataFrames.groupby_list
- Allow groupby operation on keys of type List.
-
Series
operations:is_in
- Check for membership inSeries
zip_with
- Zip two Series/ ChunkedArraysround_series
- round underlying float types ofSeries
.repeat_by
- [Repeat element in an Array N times, where N is given by another array.is_first
- Check if element is first unique value.is_last
- Check if element is last unique value.checked_arithmetic
- checked arithmetic/ returningNone
on invalid operations.dot_product
- Dot/inner product on Series and Expressions.concat_str
- Concat string data in linear time.reinterpret
- Utility to reinterpret bits to signed/unsignedtake_opt_iter
- Take from a Series withIterator<Item=Option<usize>>
mode
- Return the most occurring value(s)cum_agg
- cumsum, cummin, cummax aggregationrolling_window
rolling window functions, like rolling_meaninterpolate
interpolate None valuesextract_jsonpath
- Run jsonpath queries on Utf8Chunkedlist
- List utilsrank
- Ranking algorithms.moment
- kurtosis and skew statistics
-
DataFrame
pretty printing (Choose one or none, but not both):plain_fmt
- no overflowing (less compilation times)pretty_fmt
- cell overflow (increased compilation times)row_hash
- Utility to hash DataFrame rows to UInt64Chunked
Compile times and opt-in data types
As mentioned above, Polars Series
are wrappers around
ChunkedArray<T>
without the generic parameter T
.
To get rid of the generic parameter, all the possible value of T
are compiled
for Series
. This gets more expensive the more types you want for a Series
. In order to reduce
the compile times, we have decided to default to a minimal set of types and make more Series
types
opt-in.
Note that if you get strange compile time errors, you probably need to opt-in for that Series
dtype.
The opt-in dtypes are:
data type | feature flag |
---|---|
DateType | dtype-date |
DatetimeType | dtype-datetime |
TimeType | dtype-time |
Int8Type | dtype-i8 |
Int16Type | dtype-i16 |
UInt8Type | dtype-u8 |
UInt16Type | dtype-u16 |
Categorical | dtype-categorical |
Or you can choose on of the preconfigured pre-sets.
dtype-full
- all opt-in dtypes.dtype-slim
- slim preset of opt-in dtypes.
Performance and string data
Large string data can really slow down your queries. Read more in the performance section
Custom allocator
A DataFrame library naturally does a lot of heap allocations. It is recommended to use a custom allocator. Mimalloc for instance, shows a significant performance gain in runtime as well as memory usage.
Usage
use mimalloc::MiMalloc;
#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;
Cargo.toml
[dependencies]
mimalloc = { version = "*", default-features = false }
Config with ENV vars
POLARS_FMT_NO_UTF8
-> use ascii tables in favor of utf8.POLARS_FMT_MAX_COLS
-> maximum number of columns shown when formatting DataFrames.POLARS_FMT_MAX_ROWS
-> maximum number of rows shown when formatting DataFrames.POLARS_TABLE_WIDTH
-> width of the tables used during DataFrame formatting.POLARS_MAX_THREADS
-> maximum number of threads used to initialize thread pool (on startup).POLARS_VERBOSE
-> print logging info to stderr
Compile for WASM
To be able to pretty print a DataFrame
in wasm32-wasi
you need to patch the prettytable-rs
dependency. If you add this snippet to your Cargo.toml
you can compile and pretty print when
compiling to wasm32-wasi
target.
[patch.crates-io]
prettytable-rs = { git = "https://github.com/phsym/prettytable-rs", branch = "master"}
User Guide
If you want to read more, check the User Guide.
Re-exports
pub use polars_io as io;
pub use polars_lazy as lazy;
Modules
The typed heart of every Series column.
Data types supported by Polars.
Other documentation
DataFrame module.
Functions
Type agnostic columnar data structure.
Testing utilities.
Macros
Functions
Use a global string cache for the Categorical Types.